Enhancing Chinese Word Segmentation Using Unlabeled Data
نویسندگان
چکیده
This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
منابع مشابه
Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models. Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data. The proposed approach trains a character-based and word-based model on labele...
متن کاملExploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation
This paper presents a novel approach to improve Chinese word segmentation (CWS) that attempts to utilize unlabeled data such as training and test data without annotation for further enhancement of the state-of-the-art performance of supervised learning. The lexical information plays the role of information transformation from unlabeled text to supervised learning model. Four types of unsupervis...
متن کاملExploiting unlabeled internal data in conditional random fields to reduce word segmentation errors for Chinese texts
The application of text-to-speech (TTS) conversion has become widely used in recent years. Chinese TTS faces several unique difficulties. The most critical is caused by the lack of word delimiters in written Chinese. This means that Chinese word segmentation (CWS) must be the first step in Chinese TTS. Unfortunately, due to the ambiguous nature of word boundaries in Chinese, even the best CWS s...
متن کاملEnhancing LSTM-based Word Segmentation Using Unlabeled Data
Word segmentation problem is widely solved as the sequence labeling problem. The traditional way to this kind of problem is machine learning method like conditional random field with hand-crafted features. Recently, deep learning approaches have achieved state-of-theart performance on word segmentation task and a popular method of them is LSTM networks. This paper gives a method to introduce nu...
متن کاملImproving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. In our approach, we incorporate punctua...
متن کامل